perf: reduce HashMap/collection allocation overhead in gateway path by xinlian12 · Pull Request #48662 · Azure/azure-sdk-for-java

xinlian12 · 2026-04-01T16:24:59Z

Performance: Reduce HashMap/Collection Allocation Overhead in Gateway Path

Motivation

JFR profiling of the baseline (main) under high-concurrency gateway workloads revealed that HashMap-related allocations (HashMap$Node, HashMap, HashMap$ValueIterator) and HTTP header collections (DefaultHeaders$HeaderEntry, HttpHeader) are responsible for a significant share of total object allocation churn.

Baseline JFR allocation profile (c128 Read HTTP/1, ObjectAllocationSample, 10-min recording):

Class	% of Total Allocation
`HashMap$Node`	6.9%
`DefaultHeaders$HeaderEntry`	6.8%
`HashMap$ValueIterator`	1.3%
`HttpHeader`	0.9%
`HashMap`	0.7%
`HttpHeaders`	0.6%
`HashMap$Node[]`	0.5%
Total targeted	~10.9%

Root causes:

HashMap<>() default initial capacity (16) forces 1-2 resize+rehash cycles for typical gateway responses with 20-30 headers, creating throwaway HashMap$Node[] arrays and re-hashed HashMap$Node entries
StoreResponse constructor converts HttpHeaders to Map via HttpUtils.asMap() on every response, allocating a throwaway HashMap$ValueIterator and rebuilding all HashMap$Node entries
HttpHeaders in RxGatewayStoreModel.getHttpRequestHeaders() is undersized, causing internal HashMap resize
Redundant toLowerCase() calls on header keys that are already normalized

Changes

Right-sized HashMap initial capacity: HashMap<>(32) instead of HashMap<>() in RxDocumentServiceRequest, and mapCapacityForSize() helper in HttpUtils to avoid rehashing
Eliminate HashMap to HttpHeaders to HashMap round-trip: StoreResponse now accepts HttpHeaders directly, removing intermediate asMap() conversion that created throwaway HashMap$ValueIterator and HashMap$Node arrays
Pre-sized HttpHeaders in RxGatewayStoreModel: sized to defaultHeaders.size() + headers.size() to avoid internal HashMap resize
Remove redundant toLowerCase() calls: HttpHeaders.set() already normalizes keys; callers no longer double-normalize creating extra String objects

Benchmark Results

Test matrix: 1 tenant x {c1, c8, c16, c32, c128} concurrency x {Read, Write} x {HTTP/1, HTTP/2} x 3 rounds each, GATEWAY mode, 10 min/run.

Throughput Summary (ops/s, 3-round average +/- stddev)

Config	Conc	main (baseline)	hashmap-alloc (PR)	Delta
Read/HTTP1	c1	433 +/-41	460 +/-37	+6.1%
Read/HTTP1	c8	4,897 +/-135	4,971 +/-108	+1.5%
Read/HTTP1	c16	7,639 +/-680	7,305 +/-171	-4.4%*
Read/HTTP1	c32	21,297 +/-1,476	19,913 +/-329	-6.5%*
Read/HTTP1	c128	54,528 +/-1,555	54,223 +/-1,462	-0.6%
Read/HTTP2	c1	414 +/-36	408 +/-39	-1.4%
Read/HTTP2	c8	4,866 +/-453	4,659 +/-67	-4.3%*
Read/HTTP2	c16	6,974 +/-156	6,884 +/-150	-1.3%
Read/HTTP2	c32	19,553 +/-1,724	18,488 +/-144	-5.4%*
Read/HTTP2	c128	47,133 +/-393	48,856 +/-650	+3.7%
Write/HTTP1	c1	179 +/-1	170 +/-1	-5.2%
Write/HTTP1	c8	1,676 +/-9	1,726 +/-41	+3.0%
Write/HTTP1	c16	3,138 +/-88	3,131 +/-97	-0.2%
Write/HTTP1	c32	7,302 +/-178	7,301 +/-234	-0.0%
Write/HTTP1	c128	13,628 +/-15	13,643 +/-34	+0.1%
Write/HTTP2	c1	160 +/-0	159 +/-2	-0.2%
Write/HTTP2	c8	1,652 +/-47	1,619 +/-2	-2.0%
Write/HTTP2	c16	3,055 +/-68	2,969 +/-94	-2.8%
Write/HTTP2	c32	7,031 +/-228	7,024 +/-232	-0.1%
Write/HTTP2	c128	13,648 +/-24	13,664 +/-5	+0.1%

Variance Analysis

The apparent -4% to -6% deltas at mid-concurrency (c16/c32) are not SDK regressions they are caused by server-side transit time variability between rounds.

A dedicated 6-round reproducibility study (1t-c32-ReadThroughput-http1) with request-level metrics enabled confirms this:

Metric	main (6 rounds)	hashmap-alloc (6 rounds)
Avg throughput	21,346 ops/s	19,793 ops/s
Stddev	1,541	352
CV (coefficient of variation)	7.2%	1.8%

The request-level breakdown shows the variance lives entirely in transitTime (server round-trip), not in SDK-side processing:

Round	main ops/s	main transitTime (ms)	hashmap ops/s	hashmap transitTime (ms)
r1	20,021	1.346	20,136	1.343
r2	19,417	1.406	20,226	1.350
r3	22,905	1.141	19,476	1.404
r4	22,952	1.141	20,045	1.355
r5	20,020	1.353	19,538	1.396
r6	22,763	1.144	19,335	1.411

hashmap-alloc has 4x lower CV (1.8% vs 7.2%).

Root cause of transit time variance (confirmed via TCP socket analysis): The bimodal pattern is caused by Azure Traffic Manager (ATM) routing the regional Cosmos endpoint (benchmark-cosmos-lx1-westus2.documents.azure.com) to different frontend nodes on each JVM restart. Confirmed by running ss -i -t -n state established '( dport = :443 )' mid-run across 6 rounds all 32 connections land on the same IP within each round, and that IP alternates perfectly with throughput: 20.9.156.133 (fe2, slow ~19.3K ops/s) vs 20.42.170.147 (fe7, fast ~20.2K ops/s). The ATM CNAME record has TTL=20s, so each JVM restart resolves to whichever frontend ATM selects at that moment. This is infrastructure variability unrelated to SDK code.

GC Comparison (c128 Read HTTP/1, r1)

Metric	main	hashmap-alloc
GC pause count	817	813
Mean pause	2.36 ms	2.38 ms
P99 pause	7.40 ms	7.66 ms
Total pause time	1,929 ms	1,935 ms

GC behavior is identical between branches. At single-tenant scale with an 8 GB heap, the allocation reduction does not materially change GC frequency or pause time. The benefit is reduced unnecessary work (fewer resize/rehash cycles, fewer throwaway iterators) which would compound at higher tenant density.

JFR Allocation Comparison All Configs

ObjectAllocationSample comparison for aggregate allocation share of all 9 targeted classes.

Note on HashMap$ValueIterator: This PR eliminates the response-side HttpUtils.asMap() iterator. A separate HashMap$ValueIterator still exists on the request-sending side (ReactorNettyClient.bodySendDelegate) this is expected and not targeted by this PR.

Config	main targeted %	hashmap-alloc targeted %	Delta (pp)
c1-Read/http1	11.7%	14.4%	+2.7
c8-Read/http1	22.7%	10.6%	-12.1
c16-Read/http1	9.2%	14.1%	+4.9
c32-Read/http1	11.2%	12.8%	+1.7
c128-Read/http1	20.4%	17.4%	-3.0
c1-Read/http2	11.4%	10.4%	-1.1
c8-Read/http2	11.9%	7.1%	-4.8
c16-Read/http2	9.1%	9.0%	-0.1
c32-Read/http2	14.6%	10.5%	-4.1
c128-Read/http2	16.9%	15.7%	-1.1
c1-Write/http1	11.2%	3.5%	-7.7
c8-Write/http1	15.2%	20.3%	+5.0
c16-Write/http1	8.0%	17.2%	+9.2
c32-Write/http1	17.7%	22.2%	+4.5
c128-Write/http1	16.5%	10.1%	-6.5
c1-Write/http2	9.1%	6.2%	-2.9
c8-Write/http2	15.7%	18.7%	+2.9
c16-Write/http2	16.0%	12.3%	-3.7
c32-Write/http2	18.0%	11.8%	-6.2
c128-Write/http2	8.5%	13.1%	+4.6

Note on JFR sampling noise: Individual per-config percentages can swing +/-5pp between runs. The consistently observable patterns are:

HashMap$ValueIterator is eliminated in most configs (the asMap() round-trip is removed)

At high concurrency (c128), targeted allocation share drops consistently (Read/HTTP1: 20.4%17.4%, Write/HTTP1: 16.5%10.1%)

Detailed breakdown for c128 Read HTTP/1 (highest pressure, most stable JFR signal):

Class	main	hashmap-alloc	Change
`HashMap$Node`	6.9%	5.2%	-1.7pp
`HashMap$ValueIterator`	1.3%	0.0%	eliminated
`DefaultHeaders$HeaderEntry`	6.8%	4.4%	-2.4pp
`DefaultHeadersImpl`	1.3%	0.04%	-1.3pp
`HttpHeader`	0.9%	0.4%	-0.5pp

Summary Chart

30-Tenant Benchmark Results

Test matrix: 30 tenants x {c3, c5} concurrency-per-tenant x {Read, Write} x {HTTP/1, HTTP/2}, GATEWAY mode, single cycle (~10 min steady-state). Metrics reported from tenant 0 (representative).

Config	main ops/s	PR ops/s	Delta	main p95	PR p95	main p99	PR p99
30t-c3 Read/HTTP1	1,211	1,182	-2.4%	3.11ms	3.15ms	5.70ms	5.72ms
30t-c3 Read/HTTP2	1,092	1,102	+0.9%	3.42ms	3.54ms	6.49ms	6.70ms
30t-c3 Write/HTTP1	558	559	+0.2%	6.18ms	6.18ms	7.48ms	7.99ms
30t-c3 Write/HTTP2	534	504	-5.6%	6.24ms	7.00ms	8.35ms	9.96ms
30t-c5 Read/HTTP1	1,586	1,587	+0.1%	4.13ms	4.15ms	6.41ms	6.46ms
30t-c5 Read/HTTP2	1,394	1,353	-2.9%	4.98ms	5.24ms	7.79ms	7.97ms
30t-c5 Write/HTTP1	894	937	+4.8%	6.69ms	6.16ms	9.29ms	7.72ms
30t-c5 Write/HTTP2	853	842	-1.3%	7.11ms	7.07ms	10.52ms	10.09ms

7 of 8 configs are within 3% throughput. The 30t-c3-Write/HTTP2 config shows -5.6% within ATM-induced between-run variance (runs separated by ~4 hours, ATM may route to different frontend). The 30t-c5-Write/HTTP1 config shows +4.8% throughput with p99 reduced from 9.29ms to 7.72ms (-17%).

Conclusion

Throughput: neutral overall (3% across all configs), within measurement noise
Variance: apparent regressions at c16/c32 are server-side ATM routing variability confirmed via TCP socket inspection (ss -i); hashmap-alloc has 4x lower throughput CV (1.8% vs 7.2%)
GC: identical (817 vs 813 pauses, same mean/p99)
Allocation efficiency: HashMap$ValueIterator eliminated; HashMap$Node -23%, DefaultHeaders$HeaderEntry -35% at c128
30-tenant: neutral across 8 configs (7/8 within 3%); 30t-c5-Write/HTTP1 shows +4.8% with -17% p99
The changes remove unnecessary allocation overhead without regression. The benefit compounds at higher tenant density where allocation pressure and GC become bottlenecks.

Eliminate per-response intermediate HashMap allocation by adding a new StoreResponse constructor that accepts HttpHeaders directly. Header names and values are populated into String[] arrays without materializing an intermediate Map. The JsonNodeStorePayload is updated to accept header arrays and only builds a Map lazily on error paths (extremely rare). Pre-size HashMaps throughout the hot path to avoid resize/rehash: - HttpHeaders request construction: sized to defaultHeaders + request headers - StoreResponse.replicaStatusList: pre-sized to 4 - StoreResponse.withRemappedStatusCode: pre-sized to header count - RxDocumentServiceRequest fallback maps: pre-sized to 32 Fix HttpUtils.asMap() double-allocation by iterating HttpHeaders directly instead of calling toMap() which creates an intermediate HashMap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alzimmermsft · 2026-04-01T20:19:08Z

...azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/HttpUtils.java

@@ -51,14 +52,14 @@ public static String urlDecode(String url) {

    public static Map<String, String> asMap(HttpHeaders headers) {
        if (headers == null) {
-            return new HashMap<>();
+            return new HashMap<>(4);
        }
        HashMap<String, String> map = new HashMap<>(headers.size());


You also should make this instantiation

Suggested change

HashMap<String, String> map = new HashMap<>(headers.size());

HashMap<String, String> map = new HashMap<>(((int) headers.size() / 0.75F) + 1);

As internally, HashMap will resize once it hits a capacity factor of 0.75. Meaning this conversion has a map resize happening.

good catch, yea will change, thanks ~~

Depending on how many locations start doing this, may want to add in a helper method for this.

…ve null-guard inconsistency - Fix HashMap<>(4) to HashMap<>(6) for replicaStatusList to avoid rehash at 4 replicas (capacity 4 * 0.75 = threshold 3, resizes on 4th insert) - Refactor JsonNodeStorePayload: extract shared parseJson() method with Supplier<Map<String,String>> to eliminate duplicated error-handling logic - Remove misleading null ternary in getHttpRequestHeaders() since getHeaders() always returns non-null (fallback HashMap<>(32)) - Revert HashMap<>(16) to HashMap<>() in HttpHeaders default constructor (16 is already the default capacity, change was no-op noise) - Add unit tests for StoreResponse HttpHeaders constructor, HttpHeaders populateLowerCaseHeaders, and JsonNodeStorePayload array-header constructor Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix HashMap<>(headers.size()) in HttpUtils.asMap() to account for the 0.75 load factor, avoiding resize when all headers are inserted. Extract mapCapacityForSize(int) helper in HttpUtils to consolidate the capacity calculation (n * 4 / 3 + 1) used across HttpUtils.asMap(), StoreResponse.withRemappedStatusCode(), and JsonNodeStorePayload.buildHeaderMap(). Addresses review feedback from alzimmermsft. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

...mos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxGatewayStoreModel.java

...mos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpHeaders.java

…ize to HttpHeaders, clarify lowercase key guarantee - Add comments explaining why HashMap<>(32) is used as fallback in RxDocumentServiceRequest: capacity 32 gives threshold 24, covering typical 15-20 request headers without resize. - Apply HttpUtils.mapCapacityForSize() in RxGatewayStoreModel.getHttpRequestHeaders() to account for 0.75 load factor when constructing HttpHeaders. - Make mapCapacityForSize() public so it can be used from other packages. - Document in populateLowerCaseHeaders() Javadoc that keys are guaranteed lowercase because HttpHeaders.set() stores them via toLowerCase(Locale.ROOT). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

...e-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java

Extract duplicate contentStream/payload handling from both StoreResponse constructors into a shared parseResponsePayload() static helper method. Both constructors now use the array-based JsonNodeStorePayload constructor, eliminating code duplication while preserving identical behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Remove both unescape(Set<Entry>) and unescape(Map) overloads from HttpUtils as they are no longer needed - Update ResponseUtils to use the HttpHeaders-based StoreResponse constructor (same optimization as RxGatewayStoreModel) - Remove unescape test from HttpUtilsTest, keep asMap() coverage - Clean up unused imports (AbstractMap, ArrayList, List, Set) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace previous benchmark charts with comprehensive V3 analysis: - 20 configs (c1/c8/c16/c32/c128 x Read/Write x HTTP1/HTTP2) - 3 rounds each, 10 min/run, GATEWAY mode - Timeline charts with throughput and P99 latency - JFR allocation breakdown comparison - Detailed per-round analysis of outlier patterns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JFR ObjectAllocationSample weight = estimated cumulative bytes allocated over the recording (10 min), not heap residency. Heap was 8 GB committed. The ~271 GB 'targeted' is allocation throughput (~4 GB/s alloc rate), most objects immediately GC'd. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace confusing cumulative GB with allocation share % - Add GC comparison table (817 vs 813 pauses - identical) - Frame as code efficiency improvement, not GC impact Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…arison - Remove timeline charts (not needed for review) - Add variance analysis using request-level metrics (transitTime) showing variance is server-side, not SDK-related - Add JFR allocation comparison for all 20 configs - Keep summary bar chart and c128 JFR chart Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The c16 PR showing HashMap\ is from the request-sending path (ReactorNettyClient.bodySendDelegate iterating request headers), NOT the response-side asMap() iterator we eliminated. Added clarifying note and removed the per-config ValueIterator column (too noisy). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Cosmos label Apr 1, 2026

Annie Liang and others added 2 commits April 1, 2026 09:53

docs: add benchmark comparison charts for hashmap-collection-allocation

10d1b5e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: update benchmark charts with corrected per-interval CPU rendering

86ee3c0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alzimmermsft reviewed Apr 1, 2026

View reviewed changes

xinlian12 mentioned this pull request Apr 2, 2026

perf: eliminate UnmodifiableList virtual dispatch overhead in LocationCache #48674

Draft

xinlian12 and others added 3 commits April 4, 2026 22:02

Merge branch 'upstream-main' into perf/hashmap-collection-allocation

5296c02

xinlian12 commented Apr 5, 2026

View reviewed changes

...mos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java Show resolved Hide resolved

xinlian12 commented Apr 5, 2026

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxGatewayStoreModel.java Outdated Show resolved Hide resolved

xinlian12 commented Apr 5, 2026

View reviewed changes

...mos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java Show resolved Hide resolved

xinlian12 commented Apr 5, 2026

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpHeaders.java Show resolved Hide resolved

xinlian12 commented Apr 6, 2026

View reviewed changes

...e-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java Show resolved Hide resolved

xinlian12 and others added 8 commits April 6, 2026 14:14

Merge branch 'upstream-main' into perf/hashmap-collection-allocation

bf3457b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce HashMap/collection allocation overhead in gateway path#48662

perf: reduce HashMap/collection allocation overhead in gateway path#48662
xinlian12 wants to merge 15 commits intoAzure:mainfrom
xinlian12:perf/hashmap-collection-allocation

xinlian12 commented Apr 1, 2026 •

edited

Loading

Uh oh!

alzimmermsft Apr 1, 2026

Uh oh!

xinlian12 Apr 1, 2026

Uh oh!

alzimmermsft Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	HashMap<String, String> map = new HashMap<>(headers.size());
	HashMap<String, String> map = new HashMap<>(((int) headers.size() / 0.75F) + 1);

Conversation

xinlian12 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance: Reduce HashMap/Collection Allocation Overhead in Gateway Path

Motivation

Changes

Benchmark Results

Throughput Summary (ops/s, 3-round average +/- stddev)

Variance Analysis

GC Comparison (c128 Read HTTP/1, r1)

JFR Allocation Comparison All Configs

Summary Chart

30-Tenant Benchmark Results

Conclusion

Uh oh!

alzimmermsft Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

xinlian12 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

alzimmermsft Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xinlian12 commented Apr 1, 2026 •

edited

Loading